Mining the Web for lists of Named Entities

نویسندگان

  • Arlind Kopliku
  • Mohand Boughanem
  • Karen Pinel-Sauvagnat
چکیده

Named entities play an important role in Information Extraction. They represent unitary namable information within text. In this work, we focus on groups of named entities of the same type which we try to extract from HTML lists. Instead of starting from a class and identifying the corresponding named entities, we want to explore a new paradigm which consists in identifying sets of named entities without any knowledge on the class. A clear advantage of the approach is that it is applicable to all named entities (no matter what class), which makes it domain independent. We use HTML lists to collect candidate sets of named entities. Human assessors assessed a randomly selected sample of HTML lists. 8,25% of these HTML lists are lists of named entities of the same class. If our estimation is validated at large scale, it is possible to expect at least 890 million of such lists of named entities only in the indexed Web. Moreover, we propose an appropriate classifier which shows promising results. RÉSUMÉ. Les entités nommées jouent un rôle important en extraction d’information. Dans cet article, nous proposons une méthode pour extraire des entités nommées de la même classe au sein de listes HTML. Au lieu de partir d’une classe donnée et d’extraire les entités correspondantes, nous proposons une nouvelle approche qui consiste à identifier des ensembles d’entités nommées sans connaître leur classe d’appartenance. Un avantage évident de cette approche est qu’elle peut s’appliquer à tout type d’entité nommée (c’est à dire à des entités nommées de n’importe quelle classe). Nous utilisons des listes HTML pour identifier des ensembles candidats d’entités. Afin d’évaluer notre approche, des juges ont évalué un échantillon de listes HTML issues du Web. 8.25% de ces listes sont des listes d’entités nommées de la même classe. On peut ainsi s’attendre à trouver plus de 890 millions de listes d’entités nommées appartenant à la même classe sur tout le Web indexé. Le classifieur que nous proposons dans cet article et permettant d’identifier ces listes d’entités nommées pertinentes nous permet d’obtenir de premiers résultats prometteurs.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Learning Weighted Entity Lists from Web Click Logs for Spoken Language Understanding

Named entity lists provide important features for language understanding, but typical lists can contain many ambiguous or incorrect phrases. We present an approach for automatically learning weighted entity lists by mining user clicks from web search logs. The approach significantly outperforms multiple baseline approaches and the weighted lists improve spoken language understanding tasks such ...

متن کامل

Presenting a method for extracting structured domain-dependent information from Farsi Web pages

Extracting structured information about entities from web texts is an important task in web mining, natural language processing, and information extraction. Information extraction is useful in many applications including search engines, question-answering systems, recommender systems, machine translation, etc. An information extraction system aims to identify the entities from the text and extr...

متن کامل

A Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features

Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...

متن کامل

Mining Implicit Entities in Queries

Entities are pivotal in describing events and objects, and also very important in Document Summarization. In general only explicit entities which can be extracted by a Named Entity Recognizer are used in real applications. However, implicit entities hidden behind the phrases or words, e.g. entity referred by the phrase “cross border”, are proved to be helpful in Document Summarization...

متن کامل

OntoWM: An Ontology for Unification and Description of Web Mining

This article is concerned with the merging of two active research domains: Knowledge Discovery in Databases (KDD) and Knowledge Engineering (KE) with a main interest in Ontology. In KDD, we need to unify the domain of web mining. To overcome this drawback, several methods have been proposed in the literature. So, we propose an ontology, named OntoWM which includes definitions of basic Web Minin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011